Applied Generative AI for AI Developers
RAG = Retrieval Augumented Generation
A generative AI approach where the model combines external knowledge retrieval with text generation to provide more accurate and contextually rich responses.
Augments LLM responses with relevant context: Instead of relying solely on the LLM’s training data, RAG retrieves and incorporates specific, up-to-date information into responses.
Helps ground responses in factual information: By providing relevant context from trusted sources, RAG ensures responses are based on actual facts rather than model-generated content.
Reduces hallucinations: With access to specific, retrieved information, the model is less likely to generate incorrect or fabricated responses.
Enables use of private/proprietary data: Organizations can leverage their internal documents, knowledge bases, and proprietary information that wasn’t part of the LLM’s training data.
Provides source attribution: RAG systems can track where information comes from, making responses more transparent and verifiable.
Key Components:
Prepare documents: Clean and preprocess your source documents, removing irrelevant content and standardizing format.
Create embeddings: Convert text chunks into numerical vectors using embedding models like BGE-large-en-v1.5 (available on Hugging Face), Amazon Titan embeddings, OpenAI’s ada-002 or Cohere’s embed-multilingual.
Store in vector database: Upload embeddings to a vector store like Pinecone, Weaviate, or FAISS for efficient similarity search.
Process user query: Convert the user’s question into an embedding using the same embedding model.
Retrieve relevant context: Perform similarity search to find the most relevant document chunks.
Generate response: Combine retrieved context with an LLM prompt to generate an accurate, contextual response.
Reference: Chunking techniques with LangChain and LllamaIndex
Document segmentation approaches: Choose between fixed-size chunks, semantic chunking, or paragraph-based splitting depending on your content structure.
Chunk size considerations: Balance between too large (dilutes relevance) and too small (loses context) - typically 256-1024 tokens works well.
Overlap between chunks: Include some overlap (10-20%) between consecutive chunks to maintain context across boundaries.
Maintaining context: Preserve important metadata and hierarchical information when splitting documents.
Structured vs unstructured data: Adapt chunking strategy based on whether you’re dealing with free text, tables, or structured documents.
Key Considerations:
Model selection criteria: Consider factors like accuracy, speed, cost, and dimension size when choosing an embedding model.
Dimensionality impact: Higher dimensions can capture more information but increase storage costs and retrieval time.
Multi-lingual support: Choose models like Cohere multilingual or Amazon Titan if your application needs to handle multiple languages.
Domain-specific needs: Consider fine-tuning embedding models for specialized domains like medical or legal text. Finet-tuning using Sentence Transformers.
Features to Consider:
Scalability: Ability to handle millions or billions of vectors efficiently.
Query performance: Fast similarity search with support for approximate nearest neighbors (ANN) algorithms.
Similarity search algorithms: Support for different distance metrics (cosine, euclidean) and indexing methods.
Metadata filtering: Ability to combine vector similarity search with metadata filters.
Cost considerations: Balance between hosting costs, query costs, and storage requirements.
Advanced Techniques:
Query rewriting: Reformulate user queries to improve retrieval performance, often using an LLM to generate better search terms.
Query expansion: Generate multiple variations of the query to increase the chance of finding relevant information.
Entity extraction: Identify and use key entities from the query for more focused retrieval.
Hybrid search: Combine semantic (embedding-based) and lexical (keyword-based) search for better results.
Query decomposition: Break complex queries into simpler sub-queries that can be processed independently.
Metrics:
Mean Reciprocal Rank (MRR): Measures how early in the results list the first relevant document appears.
Mean Average Precision: Evaluates precision at different recall levels, providing a single score for overall retrieval quality.
Normalized Discounted Cumulative Gain: Measures the quality of ranking by considering both relevance and position.
Recall: Assesses whether all relevant documents are retrieved from the collection.
Popular Frameworks:
LlamaIndex: Provides high-level abstractions for building RAG applications with features like structured data handling and custom retrievers.
Haystack: Offers modular components for building production-ready search and RAG systems with strong evaluation capabilities.
Langchain: Enables building complex chains of operations for RAG applications with extensive integration options.
Amazon Bedrock Prompt Flows: Managed service for building and deploying RAG applications with integration to AWS services.
RAG full picture
| Category | Link | Description |
|---|---|---|
| RAG | https://arxiv.org/pdf/2005.11401 | Paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks |
| RAG | https://arxiv.org/pdf/2412.15605v1 | Paper: Cache-augmented generation (CAG) as an alternative to RAG |
| RAG | https://arxiv.org/pdf/2401.15884 | Paper: Corrective Retrieval Augmented Generation |
| RAG | https://arxiv.org/html/2409.13731v3 | Paper: KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation |
| RAG | https://x.com/akshay_pachaar/status/1875520939536142656 | Traditional RAG Vs Graph RAG |
| RAG | https://www.dailydoseofds.com/p/traditional-rag-vs-hyde/ | Traditional RAG Vs HyDE |
| RAG | https://www.theunwindai.com/p/build-a-corrective-rag-agent | Build a corrective RAG application |
| RAG | https://x.com/Aurimas_Gr/status/1879148810158452777 | Challenges and components of production-grade RAG AI systems |
| RAG | https://x.com/akshay_pachaar/status/1879154648327811134 | Building a multi-tenant RAG app with easy integrations |
| RAG | https://x.com/akshay_pachaar/status/1878916141122462139 | MemoRAG enhances RAG with long-term memory capabilities |
| RAG | https://arxiv.org/pdf/2412.15605v1 | Cache-augmented generation (CAG) as an alternative to RAG |